Method and system for extraction and organizing selected data from sources on a network

ABSTRACT

Described is a system and method for employing user created database-structured queries and data extraction engines to crawl through Websites extracting and organizing data from selected sources on a network, such as the Internet. The structure of a query processed by a Data Extraction engine enables a user to treat the network as a searchable database. The database-structured queries provide a user with tools to match patterns on selected sites on the network. A user may automate database-structured queries to be executed on a regular frequency. Output of the database-structured queries may be placed into a data log, displayed on a user display screen, or optionally reshaped for use by a plurality of data analysis tools. Additionally, an optional graphical user interface is provided.

RELATED APPLICATION

[0001] This utility patent application is a continuation of a previouslyfiled U.S. provisional patent applications, U.S. Ser. No. 60/197,076filed on Apr. 13, 2000, and U.S. Ser. No. 60/262,721 filed on Jan. 18,2001, the benefit of the filing dates which are hereby claimed under 35U.S.C. §119(e).

FIELD OF THE INVENTION

[0002] The invention relates generally to providing data from a network,and more particularly to the extracting and organizing of selected datafrom sources on a network.

BACKGROUND OF THE INVENTION

[0003] The World Wide Web has been recognized as a vast reservoir ofinformation. There are literally terabytes of highly valuable documentsand other files on the Internet and other networks.

[0004] Such a vast resource provides businesses, researchers, andconsumers with information never available to them in the past. However,while a vast quantity of information is available on the Internet,finding information with sufficient precision to address specificquestions has remained difficult and expensive.

[0005] Attempts have been made to provide tools that will assist usersin locating information on the Internet. The more common tools includesearch engines that crawl network sites. These software programs areoften programmed to follow universal resource locator (URL) linkscollecting information from Websites they visit.

[0006] While these search tools provide a much-needed service to theInternet user, they remain limited in their usefulness. For example,today's businesses seek more than simply a location for generalinformation. Businesses desire the ability to use the network to answerdynamic strategic marketing questions, monitor competitors, identify newopportunities, and analyze trends. Unfortunately, while the currentsearch tools provide listings or pointers to locations on the Internetthat may have helpful information, the information is often not at alevel of precision necessary to answer today's complex businessquestions.

[0007] In the past, due to the complexity of the business questionsbeing asked, businesses have had to pay large numbers of employees tomanually execute multiple search engines, manually aggregate theresults, and then manually extract the relevant data from those results.Finally, employees would have to format the extracted data so that itcould be used by the business. While this may provide businesses withmore precise results, it remained an overwhelming, expensive, and slowapproach to finding answers to complex business questions.

[0008] Alternatively, businesses have expended massive amounts of timeand labor in developing custom single query software programs in anattempt to take advantage of the information available on the Internet,and improve the precision of the searches. The development of thesecustom single queries is often long, tedious, and requires continuallabor to monitor the results. Because these software programs are oftenwritten to address a particular business question, businesses mustcontinually invest large quantities of money for each novel questionraised. The result is that businesses must invest heavily in maintainingskilled programmers, computing resources.

[0009] Finally, businesses that have invested heavily in commerciallyavailable analysis software programs seek to take advantage of thoseprograms to analyze the results from Internet queries. However, theinformation from the queries is typically not in a format that theanalysis programs can readily use.

SUMMARY OF THE INVENTION

[0010] The present invention is directed at providing a system andmethod for creating and using database-structured queries for extractingdata from a network, such as the Internet.

[0011] According to one aspect of the invention, a database-structuredquery is used to extract data from a network, such as the Internet. Adatabase-structured query is created that treats the content on thenetwork as a searchable database. Data is extracted from the web domainaddress based on the database-structured query.

[0012] According to another aspect of the invention, adatabase-structured query is created having regular expressions used tolocate and extract data from the network.

[0013] According to yet another aspect of the invention, adatabase-structured query for extracting data from web domains havingcontent is created containing a request to follow links within the webdomain address. Links are followed until the links have been exhaustedor until a predetermined limit is reached during execution of thedatabase-structured query.

[0014] According to another aspect of the invention, a text editorwithin a client may be used to create the database-structured query. Thedatabase-structured query may be created from a template of regularexpressions that may be used to extract data from the network.

BRIEF DESCRIPTION OF THE DRAWINGS

[0015]FIG. 1A is a schematic block diagram illustrating an embodiment ofa Customer Intranet Data Extraction Services System;

[0016]FIG. 1B is a schematic block diagram illustrating an embodiment ofa Third Party Intranet Data Extraction Services System;

[0017]FIG. 2 is a schematic block diagram illustrating a client-serverembodiment of a Data Extraction System;

[0018]FIG. 3 is a flow diagram illustrating an overview of a process forextracting and organizing selected data on a network;

[0019]FIG. 4 is a flow diagram illustrating a process for crawlingnon-cached database entries of URLs;

[0020]FIG. 5 is a flow diagram illustrating a process for parsing HTMLcontent for data extraction; and

[0021]FIG. 6 is a schematic block diagram illustrating an embodiment ofa Data Extraction Client system environment.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

[0022] In the following detailed description of exemplary embodiments ofthe invention, reference is made to the accompanied drawings, which forma part hereof, and which are shown by way of illustration, specificexemplary embodiments of which the invention may be practiced. Theseembodiments are described in sufficient detail to enable those skilledin the art to practice the invention, and it is to be understood thatother embodiments may be utilized, and other changes may be made,without departing from the spirit or scope of the invention. Thefollowing detailed description is, therefore, not to be taken in alimiting sense, and the scope of the invention is defined only by theappended claims. Referring to the drawings, like numbers indicate likeparts throughout the views.

[0023] The invention is directed at providing a method and system forenabling users to treat the information residing on a network like adatabase by developing database-structured queries to extract andorganize data from sources on the network. The term,“database-structured queries” include any of a plurality of stylizedforms written to interrogate related files for information where thecontents of the files are organized in such a way that a computerprogram may choose (or select) distinct pieces of the information. Theterms database-structured query and query may be used interchangeably.

System Operating Environment

[0024]FIG. 1A shows a schematic block diagram illustrating an overviewof a Customer Intranet Data Extraction Services System 100. As shown inthe figure, the Data Extraction Services System 100 consists of Internet102, Websites 104 (1 through N), Internet Data Extraction (DE)Authentication and Update Services 106, Data Extraction (DE) Clients108, Data Extraction (DE) Services 110, and customer Intranet 112. Itshould be noted that while the embodiment of the invention in FIG. 1Aemploys an Internet 102 and Intranet 112, the invention is not solimited. The Internet 102 and Intranet 112 may be replaced by similarnetwork configurations without departing from the spirit or scope of theinvention.

[0025] Data Extraction Clients 108, Data Extraction (DE) Services 110,Websites 104 (1 through N) are coupled to the Internet 102. The Internet102 provides a communication path between the Data Extraction Clients108 and the Data Extraction Services 110. The Internet 102 also providesa communication path between the Data Extraction Services 110 and theWebsites 104 (1 through N).

[0026] In the embodiment shown in FIG. 1A, the Data Extraction Services110 exist within an Intranet 112. Data Extraction Clients 108 optionallymay exist within the Intranet 112 and couple to the Data ExtractionServices 110.

[0027] The Internet 102 refers to the worldwide collection of networksand gateways that use the Transmission Control Protocol/InternetProtocol (“TCP/IP”) suite of protocols to communicate with one another.The Internet is composed of a backbone of high-speed data communicationlines between major nodes or host computers, including thousands ofcommercial, government, educational, and other computer systems, thatroute data and messages. According to one embodiment of the invention,the Internet 102 may be generalized to any network structure such aslocal area networks (LANs), wide area networks (WANs), or directconnections, such as through a universal serial port (USB), or anycombination thereof.

[0028] An Intranet 112 is a computing network based on TCP/IP protocols,and is typically used by businesses. Typically, Intranet 112 isaccessible only by the business's members, employees, or those withauthorization. Because Intranet 112 uses substantially similarcommunications protocols and hypertext links as the Internet 102, itprovides a way of disseminating information internally to a business andextending the business worldwide.

[0029] The Customer Intranet Data Extraction Services System 100 mayalso use a communication media that embodies computer readableinstructions, data structures, or other data in a modulated data signalsuch as a carrier wave or other transport mechanism and includes anyinformation delivery media. A modulated data signal is a signal thatencodes information in a signal. For example, communication mediaincludes wired/wireless networks, acoustic, RF, infrared and otherwireless media.

[0030] Remote Data Extraction Clients 108, perhaps administrated by theIntranet 112 customer, would communicate with the DE Services 110through the Internet 112.

[0031] One possible application of the embodiment illustrated in FIG. 1Awould provide for a customer or business to administer its own Intranet112. Additionally, the Internet DE Authentication and Update Services106 and Intranet 112 may be administered by different businesses.

[0032] Data Extraction Clients 108 request execution or scheduling ofuser's database-structured queries. The Data Extraction Services 110determine if the Data Extraction Clients' 108 requests are authorized.The Internet DE Authentication and Update Services 106 returns anencrypted message to the Data Extraction Services 110 granting ordenying the Data Extraction Clients' 108 requests.

[0033] Once authorization for execution has been granted by the InternetDE Authentication and Update Services 106, the Data Extraction Services110 perform the database-structured query by crawling through Websites104 (1 through N) extracting and organizing data according to the query.Data that matches a user's query is made available to the DataExtraction Clients 108, a data analysis software program (not shown), ora plurality of other applications a user may specify.

[0034]FIG. 1B is a schematic block diagram illustrating a Third PartyInternet Data Extraction Services System 120 according to an embodimentof the invention. The system as shown in FIG. 1A is substantiallysimilar to the system as shown in FIG. 1B. The Internet DEAuthentication and Update Services 106 from FIG. 1A, however, has beenremoved and replaced by the Intranet DE Authentication and UpdateServices 122 within Third Party Intranet 114. The optional DataExtraction Clients 108 have also been removed in FIG. 1B.

[0035] As shown in FIG. 1B, the Intranet DE Authentication and UpdateServices 122 is coupled to the Data Extraction Services 110. Third partyIntranet 114 is coupled to the Internet 102.

[0036] The Third Party Intranet Data Extraction Services System 120shown in FIG. 1B operates in substantially the same manner as theembodiment illustrated in FIG. 1A. However, while FIG. 1A illustratesthe Internet DE Authentication and Update Services 106 as communicatingthrough the Internet 102 to the Data Extraction Services 112, FIG. 1Billustrates the Intranet DE Authentication and Update Services 122 ascommunicating directly to the Data Extraction Services 110. Optionally,the Intranet DE Authentication and Update Services 122 may communicatethrough the Third Party Intranet 114 to the Data Extraction Services 110when the two services reside on different computing devices within thesame Intranet. FIG. 1B, then illustrates an embodiment where a ThirdParty Provider might administer both the Intranet DE Authentication andUpdate Services 122 and Data Extraction Services 110 for remote DataExtraction Clients 108.

[0037]FIG. 2 is a schematic block diagram illustrating a client-serverarchitecture Data Extraction System 200 according to one embodiment ofthe invention. As shown in the figure, the Data Extraction System 200consists of Data Extraction Clients 108, a Data Extraction EnterpriseServer 202, a Web Server 204, batch files 206, and Data Extractionengines 208 (1 through N).

[0038] A client-server architecture, sometimes called a two-tierarchitecture, may consist of a network of hardware and software in whicheach computing device or software process is designated as either aclient or a server. Servers may be hardware computing devices orsoftware processes that have been dedicated to managing storage devices,executing specific programs (or processes), printers, or even networktraffic. Clients are typically hardware computing devices on which auser would execute application programs. A Client device would employservers for resources or optionally for the execution of specificprograms.

[0039] While FIG. 2 employs a client-server computing architecture, thedisclosed invention is not so limited. In light of this disclosure, itwill be recognized by one skilled in the art that the invention may takeadvantage of a plurality of other computing architectures.

[0040] As shown in FIG. 2, the Data Extraction Clients 108 are coupledto the Data Extraction Enterprise Server 202. The Data ExtractionEnterprise Server 202 is coupled to each of the Data Extraction engines208 (1 through N). Web Server 204 and batch files are also coupled toeach of the Data Extraction engines 208 (1 through N).

[0041] The Data Extraction Clients 108 (FIG. 1A) communicatedatabase-structured query requests to the Data Extraction EnterpriseServer 202. The Data Extraction Enterprise Server 202 may administer theData Extraction engines 208 by employing Data Extraction engines 208 toexecute a user's database-structured query. Optionally, a userexperienced in developing Web pages may employ Web Server 204 toadminister Data Extraction engines 208 to request execution of adatabase-structured query. Similarly, users experienced in programmingscripts may employ batch files 206 to administer the Data Extractionengines 208 to execute database-structure queries on the network.Additionally, a combination of the above approaches may be employed.

[0042] One or more Data Extraction engines 208 perform the requesteddatabase-structured queries providing the results to a data log (notshown). The Data Extraction engines 208 communicate completion status tothe administrating program, once the query is completed. The user maythen perform analysis on the results of the query. Where the DataExtraction engines 208 were launched through Web Server 204, the usermay optionally display the results to a user's browser web page (notshown).

Overview of Database-structured Network Queries

[0043] The disclosed invention employs a database-structured querylanguage to treat content on a network as a searchable database.Briefly, sets of query conditions (clauses) are created that are usedwith network crawlers (software programs) to traverse specified Websitedomains and Website content.

[0044] Referring to FIG. 2, Data Extraction engines 208 employdatabase-structured regular expressions to scan Website content andreturn matched data to a tab delimited data file. Because a regularexpression may use literal characters such as “1234” and symboliccharacters such as “[1-4]” to describe patterns of strings to match, auser is provided a flexible set of tools to develop patterns.

[0045] As an example, the following regular expression could be used toextract a telephone number in Website content:

Phone\s ([−\d]+)

[0046] where \s would match tabs, spaces, new lines, or carriagereturns, \d would match numeric digits, and the + would be used to matcha series of one of more of the previous elements (in this example,numeric digits). The ( ) symbols denote the part of the regularexpression that a user desires to extract from the Website content.Finally, the [ ] define character classes to match. In the aboveexample, the pattern matches until the next character is not a hyphen ora digit. Therefore, this regular expression example could return a phonenumber found in the Website content such as “Phone 1-800-124-5679”.

[0047] A typical database-structured query may contain a plurality offundamental components or clauses. Three examples of fundamental queryclauses include a select clause, a from clause, and a where clause.

[0048] A select clause is employed where a user desires specifiedregular expression pattern to be matched during a search of a networkpage.

[0049] A from clause provides the network locations, such as universalresource locator (URLs), where the Data Extraction engines 208 begins asearch.

[0050] A where clause contains conditions describing how Data Extractionengines 208 are to search networks for relevant data.

[0051] A general database-structured query format might look similar to:

[0052] select

[0053] [functions with regular expressions]

[0054] from

[0055] [network address (URL)]

[0056] where

[0057] [conditions]

[0058] The select clause may provide for a plurality of data extractionfunctions that take advantage of regular expressions to describe datapatterns. Possible data extraction functions provide for text string andtable pattern matches. Optionally, complex database-structured queriesproviding for logic flow control, logical tests, and even variablemanipulation may be employed.

[0059] An example of a text database-structured query that a user mightemploy to find the name of the U.S. President could look similar to thefollowing:

[0060] select

[0061] text(“President\s ([A-Z]\w+\s[A-Z]\w+)”, “”, “”, “sT”)

[0062] from

[0063] http://www.whitehouse.gov/WH/EOP/html/principals.html

[0064] where the first argument in the text function is a regularexpression to be matched during the database-structured query. In thisdatabase-structured query, principals.html is scanned for the match:

[0065] President\s ([A-Z]\w+\s [A-Z]\w+)

[0066] In the regular expression, [A-Z] denotes a range of singlecapital letters, followed by \w, which matches an alphanumericcharacter, followed by +. The + is a metacharacter (a character providedwith additional significance), used to match one or more of thepreceding elements. Therefore, \w+, means one or more alphanumericcharacters that immediately follow a capital letter. The \s that followsis used to match a space character. This is followed by the [A-Z]\w+.When combined, this regular expression matches two words separated by aspace and both words begin with capitalization. Because, this textfunction in this example includes the word President and a space, it maybe used to find a first name, space, last name string preceded by theword President.

[0067] Because tables are one of the principal tools used in layoutdesign of Web pages, a plurality of table functions are provided. Tablefunctions provide a way to search data intensive network sites, wheremost pages share the same layout, and the data of interest to a user ispresented with a tabular look to it.

[0068] One table function, tables ( ), may be used to return contents ofeach cell in a table on a single, usually long line. The followingexample of the table function might be employed to find gather stockquotes for stocks listed in a file:

[0069] tables(“Last\sTrade”, “”, “tn”)

[0070] where the first argument, “Last\sTrade”, is the regularexpression to be matched within a desired table, the second argument(here, the null set, “”) might specify the depth or number of layersdeep to extract nested tables, and the third argument, “tn”, identifiesoptions on the table search. For example, the “tn” argument, mightdenote deletion of null cells, and HTML tags after the HTML scan.

[0071] The from clause contains the network address, typically as a URL,of a network location from where the Data Extraction engines 208 startsto crawl in its search for matching patterns. A possible entry in a fromclause might look similar to the following:

[0072] http://www.domainname.com/page.html.

[0073] The where clause employs a plurality of functions to specify howthe Data Extraction engines 208 crawl the network. In one possibleapproach, the where clause may employ a follow links function.

[0074] The follow links function instruct Data Extraction engines 208 tofollow the links (URLs) initially provided and then follow additionallinks on each of the Web pages it linked to, until every link has beenfollowed, or, optionally to a user specified depth. By way of anexample, the following might be used to follow relative URL links to adepth of two (2), starting at a network address contained within apredetermined from clause:

[0075] where

[0076] approach=followlinks (“”, “”, “relative”, “2”)

[0077] where the first argument in the followlinks function could be aregular expression denoting where a scan starts. The second argumentcould be a regular expression defining where a scan would stop. As theengine follows the links, it keeps track of visited Web pages, so itwill visit a particular Web page once no matter how many times it islinked to in a given Web page. Therefore, in this example, the first twoarguments are not used. The third argument provides the type of links tocrawl. In the example above, relative links describes those linkslocated on a network “relative” to the current location of the Web pagebeing viewed. The fourth argument (here, two), denotes the number oflayers deep to crawl. If the depth argument is left empty, it will crawlinfinitely (or until all links have been crawled).

[0078] A user optionally may employ a sequence approach function in thewhere clause. The sequence function may increment numbers by a userselected step value, substituting that value into a URL string. The URLstring is employed as a new network address to be searched. A user maydefine a number to start at, a number to stop the search at, and anamount to increment by as arguments to the function. A fourth argumentmay define the substitution symbol, to denote the position in the fromclause where the values will be substituted. By way of illustration, thefollowing script might be used to retrieve matching data from a calendaron a network site pages from January of 1999:

[0079] select

[0080] [data of interest on a page]

[0081] from http://www.calendarsite/sequence/99/01/#%.html

[0082] where

[0083] approach=sequence(“0”,“3”,“1”,“#”)(“0”,“9”,“1”,“%”)

[0084] Another possible approach function that a user may employ is alist function. A list function may iterate through a predefined list ofwords, substituting each item into a URL string at a place marked by asubstitution symbol. The result is to automatically change the URL everytime, substituting values specified from the list. For example, thefollowing might retrieve a list of stock quotes:

[0085] select

[0086] tables(“last”, “0”, “nt”) (“Chart”) (“Symbol”, “Change”,“Volume”, “More Info”)

[0087] from

[0088] http://finance.website.com/q?s=####

[0089] where

[0090] approach=list(“stock_symbols.txt”, “####”)

[0091] Such functions and clauses provide a user with adatabase-structured query language that may be employed to createsearches for a network, such as the Internet. However, the invention isnot limited to the specific functions described within.

Illustrative Embodiment of Data Extraction Client system

[0092]FIG. 6 is a schematic block diagram illustrating a Data ExtractionClient system environment 600 according to an embodiment of theinvention. As shown in the figure, the Data Extraction Client systemenvironment 600 includes Data Extraction Clients 108 and Data ExtractionServices 110. Data Extraction Clients 108 are coupled to Data ExtractionServices 110. Data Extraction Clients 108 includes a Chat window 602, agraphical user interface (GUI) dialog window 604, a web browser 606, aPoint and Click query generator 608, Help and Syntax tools 610, and atext editor 616. The text editor 616 consists of a save query 612, aload query 614, and an optional (shown as a dashed box) gamma query 618.The optional gamma query 618 couples and communicates with the savequery 612.

[0093] The GUI dialog window is coupled to the chat window 602, webbrowser 606, and help and syntax tools 610. The Help and Syntax tools610 are coupled to the Point and Click query generator 608 and to thetext editor 616.

[0094] As shown in text editor 616, the gamma query 618 couples to theload query 614 and save query 612. In embodiments without the gammaquery 618, the text editor 616 would invoke and communicate with thesave query 612 and the load query 614.

[0095] The Data Extraction Clients 108 provide a user interface set ofwindows for a user to create, schedule, and execute database-structuredqueries.

[0096] The GUI dialog window 604 provides a window for a user tomonitor, or stop queries currently running, or waiting for availableData Extraction engines 208 (FIG. 2) to become available for allocation.Status of database-structured queries and other messages arecommunicated from the Data Extraction Clients 108 to the Data ExtractionServices 110, and may be displayed either in the Chat window 602, oroptionally in the GUI dialog window 604.

[0097] Chat window 602 is a software program that users may employ tocommunicate with one another as to status of database-structured queriesor Data Extraction engines 208. In a client-server architecture, theChat window 602 may provide communication with a system administrator.

[0098] A user may launch a web browser 606 through the GUI dialog window604. The web browser 606 may be employed to provide either a renderedview of a Website 104, or optionally a view of the content from aWebsite 104. Rendering of a Website 104 in its content structureprovides a way for user's to view potential tags and patterns useful inthe creation of database-structured queries.

[0099] The GUI dialog window 604 communicates with Help and Syntax tools610, providing a user with on-line documentation and user helpinstructions. The Help and Syntax tools 610 in conjunction with thePoint and Click Query Generator 608 provide a user with tools to createdatabase-structured queries, and check existing database-structuredqueries for proper syntax format. The Syntax component of the Help andSyntax tools 610 may be employed to create prototypes of functions withdescriptions of their parameters, to assist a user in creating complexdatabase-structured queries. The Point and Click Query Generator 608provides a set of tools to select patterns from HTML pages displayed ina web browser 606.

[0100] The text editor 616 allows a user to create and modifydatabase-structured queries. A user may save database-structured queriesfor later use by executing the save database-structured query 612, andload saved queries by employing the load query 614. The combination oftools then provides a user with the ability to create and savedatabase-structured query scripts for later use or to share them withother users.

[0101] An optional gamma query 618 is shown in this embodiment. Thegamma query 618 provides a user with a set of possible patterns tosearch from within a given Website 104. The gamma query 618 examinesselected HTML content that has been passed to it through the GUI dialogwindow 604 from the Web browser 606. From the HTML content the gammaquery 618 creates a template of suggested database-structured queriesfor the user. The user may employ the text editor 616 to edit thetemplate database-structured query or save it with the save query 612.

Generalized Operation for Data Extraction Requests

[0102]FIG. 3 is a flow diagram illustrating a process for extracting andorganizing selected data on Internet sites. Briefly, the data extractionand organization process 300 in FIG. 3 creates a data log of resultsbased on a set of database-structured query clauses developed by a user.The data log results may be reshaped into a predetermined format to makethe data available for analysis.

[0103] As shown in FIG. 3, after a start block, the logic flows to block302 where a user creates a database-structured query request (see FIG. 6and related discussion). At block 302, the user creates a set of regularexpressions that direct the search. The user will typically employ theabove described select, from, and where clauses to create thedatabase-structured query. The database-structured query clauses thenare passed to block 304.

[0104] At block 304, the user's database-structured query clauses areparsed to determine which Websites 104 (1 through N) (FIG. 1A) tocommence a search, how deep to search a site, and what data to extract.Block 304 then forwards the parsed information to block 306.

[0105] Block 306 uses the parsed information to generate entries into aninternal database. The entries may be the result of a request to followa set of URL links to a specified depth. The entries may optionally be asequence or increments of URLs based on some algorithm. For example,where a Website may have numerous URLs, a user may write adatabase-structured query to sequence through the URLs selecting onlythe first 20000 URLs. The user may also request a list of keywords thatare iterated through, substituting the keyword into a URL. Whicheverclause is sent to the block 306, an initial list of URLs is created inthe database. The process then proceeds to block 308 to use the internaldatabase in searching for the user data patterns (see FIG. 4 and relateddiscussion).

[0106] Briefly described, block 308 employs the internal databasecreated in block 306, and information obtained from block 304, to crawlthrough the identified Websites 104 (1 through N). Once the requestedextraction of data is complete, the process proceeds to block 310.

[0107] At block 310, logged data is reshaped into a predeterminedformat. The logged data may be reshaped into a plurality of formats. Forexample, the user may reshape the logged data to make it available torelational database tools, spreadsheets, XML (eXtensible MarkupLanguage) display, and the like. Reshaped data is exported from block310 to block 312, where the data is analyzed. The data may be analyzedby a plurality of analysis tools. The analysis tools are not limited,and may include commercially available analysis database, spreadsheet oreven statistical analysis tools. Upon completion of the data analysis atblock 312, the logical flow ends. A user may repeat process 300 foradditional network database-structured queries.

[0108]FIG. 4 shows one embodiment for a process for crawling non-cacheddatabase entries of URLs. Process 400 iterates through an initialdatabase of URLs, until the list of websites, identified by their URL,is exhausted. As process 400 iterates through the URL list, the contentof the network sites are scanned and additional URLs may added to thedatabase. This typically may arise where a user has employed the followlinks function within the where clause of a database-structured query.

[0109] After a start block, the logic moves to decision block 402.Decision block 402 determines when a non-cached URL has been found inthe database. When a non-cached URL has been found in the database, theprocess proceeds to decision block 404. A non-cached URL arises when atleast one non-searched URL exists in the database. When a non-cached URLhas not been found in the database, the logical flow ends.

[0110] Decision block 404 determines if the requested site requires auser password for login. If the Website is password restricted, theprocess proceeds to block 406, where a password is employed to log intothe site. A user typically specifies the password and user name as partof the site information. Once logged into the site, the process proceedsto block 408. If at decision block 404 no login password is required,the process proceeds to block 408.

[0111] At block 408, the HTML content from the identified URL Website104 is retrieved.

[0112] Moving to block 410, the Website whose HTML content was retrievedaccording to the logic in block 408 will be marked in the database ashaving been searched. Marking the Website 104 in the database permitsthe process to resume where it left off in list of URLs, should thedatabase-structured query be interrupted. Execution of the process thenproceeds to block 412.

[0113] Block 412 parses the HTML content to extract and log userrequested data matches (see FIG. 5 and related discussion). Process 400then proceeds to decision block 414.

[0114] Decision block 414 determines if the HTML content has additionalURL links for possible further crawling. If additional URL links arefound in the HTML content, the process proceeds to block 416. If noadditional URL links are found in the HTML content, the process returnsto decision block 402.

[0115] At block 416, additional URLs that were located within the parsedHTML are evaluated for possible addition to the internal database. A URLmight be added to the database when a user employed lists or sequencesin a where clause. Once the database has been updated, the processreturns to decision block 402.

[0116] Process 400 iterates through the list of URL entries in thedatabase until the list is exhausted. Once the list of URL entries isexhausted, the logical flow ends.

Optional Data Selection and Extractions

[0117]FIG. 5 illustrates a logical flow for parsing HTML content fordata selection and extraction. After a start block, the logical flowmoves to block 502, where HTML content is reduced to a region ofinterest. The remaining HTML content contains a region of interest thatmay provide matches to patterns identified in the database-structuredqueries. A region of interest may be any line of HTML content that islikely to include HTML metatags, tables, images, or links to otherWebsites. Generally, comment lines, format tags such as paragraph tagsand italics tags, and other similar HTML code do not include thisinformation. Process 500 then continues to decision block 504.

[0118] Decision block 504 determines whether there is a request to parseHTML tables. This may be determined by the existence of a HTML tablefunction within a select clause. When the select clause does employ aHTML table function, the process proceeds to block 506. Typically, aHTML table is used for lists, specifications and other tabular data aswell as to locate elements on the page. Because the table command givesthe HTML designer reasonably precise control over the layout of text andimages some of the more relevant information to the user's search may bestored in tables. At block 506, the data is extracted from HTML tables.A variety of ways may be employed to extract data from the HTML tables.Once relevant data is extracted from the HTML content, the processproceeds to block 508, where the extracted HTML table data is saved to adata log. The data log may be a flat file such as a tab delimited textfile. Optionally, data may be printed to the user's display screen. Theprocess 500 then proceeds to decision block 510.

[0119] If decision block 504 determines that there is no request toparse HTML tables (i.e., no table functions were employed within theselect clause), the process proceeds to decision block 510.

[0120] Decision block 510 determines if there is a matching patternfound in the HTML content. Patterns of interest to the user may arisefrom how Website pages are grouped or linked to each other as well ashow data is displayed on a Web page in HTML. A pattern may be found whenit matches the regular expression defined in a database-structured queryrequest. When a pattern is found in the HTML content, such as onematching the regular expression provided in the select clause, theprocess proceeds to block 512.

[0121] At block 512, the data matching the pattern is extracted from theHTML content. The process continues to block 514 where the extracteddata is saved to a data log file. The process then proceeds to decisionblock 516.

[0122] When decision block 510 determines that there is no pattern matchfound in the HTML content, the process proceeds to decision block 516.

[0123] Decision block 516 determines if there is a request to downloadbinaries. A binary file download might request downloads of graphicalimages stored in .JPEG or .GIF formats. A binary file download requestoptionally, might be made to download audio files stored in MP3, audiocompression format. When binary files are identified for downloading,the process 500 proceeds to block 518.

[0124] At block 518, binary files that match the database-structuredquery request are downloaded. The process proceeds to block 520 to savethe extracted (downloaded files) to a specified location. The logicalflow then ends.

[0125] At decision block 516, when the user does not specify binaryfiles for download, or no matches are found in the HTML content, thelogical flow ends.

[0126] The above specification, examples and data provide a completedescription of the invention. Since many embodiments of the inventioncan be made without departing from the spirit and scope of theinvention, the invention resides in the claims hereinafter appended.

We claim:
 1. A method for extracting data from a network, comprising:(a) creating a database-structured query; (b) determining a web domainaddress on the network from which to extract the data, the web domainaddress having content; and (c) extracting data from the determined webdomain address based on the database-structured query.
 2. The method ofclaim 1 , wherein creating the database-structured query, furthercomprises, including a network address within the database-structuredquery indicating a starting point.
 3. The method of claim 2 , whereinthe web domain address, includes at least one universal resource locator(URL).
 4. The method of claim 2 , wherein the web domain address,further comprises, following links contained within the web domain untilthe links have been exhausted or following the links until apredetermined limit is reached.
 5. The method of claim 1 , whereincreating the database-structured query, further comprises, creating aregular expression within the database-structured query used todetermine the data to extract.
 6. The method of claim 5 , whereinextracting data from the determined web domain address based on thedatabase-structured query, further comprises, matching a plurality ofpatterns contained within the regular expression to the content todetermine the data to extract.
 7. The method of claim 1 , whereincreating the database-structured query, further comprises, creating aconditional expression within the database-structured query describinghow to scan the content for the data to extract.
 8. The method of claim1 , wherein the extracting data from the determined web domain, furthercomprises: (b) retrieving content from the web domain address; (c)reducing the retrieved content to a region of interest; and (d)searching the region of interest for the data matching a predeterminedregular expression.
 9. The method of claim 8 , wherein extracting thedata from the determined web domain, further comprises, storing the datamatching the predetermined regular expression.
 10. The method of claim 9, wherein extracting the data from the determined web domain, furthercomprises, reshaping the stored data by arranging the stored data for atleast one data analysis software program.
 11. A computer-readable mediumhaving computer-executable instructions for extracting data from anetwork comprising: (a) creating a database-structured query including aweb domain address used for locating content; (b) locating the contentbased on the web domain address; and (c) extracting data based on thedatabase-structured query from the located content.
 12. Thecomputer-readable medium of claim 11 , wherein the database-structuredquery, further comprises, a network address included within thedatabase-structured query indicating a starting point.
 13. Thecomputer-readable medium of claim 12 , wherein the network address,further comprises at least one universal resource locator (URL).
 14. Thecomputer-readable medium of claim 11 , wherein the web domain address,further comprises, links contained within the web domain to be followeduntil the links have been exhausted or until a predetermined limit isreached.
 15. The computer-readable medium of claim 11 , wherein thedatabase-structured query, further comprises, a regular expressionwithin the database-structured query used to determine the data toextract.
 16. The computer-readable medium of claim 15 , wherein theregular expression within the database-structured query, furthercomprises, a plurality of patterns used to determine the data to extractfrom the web domain address having content.
 17. A system for extractingdata from a network comprising: (a) a client computer system having aclient network connection to the network and communicating with a servercomputer system, the client creating a database-structured query; (b)the server computer system having a server network connection to thenetwork and communicating with the client computer system, the serverdetermining a web domain address from which to extract the data frombased on the database-structured query.
 18. The system of claim 17 ,wherein the database-structured query, further comprises, a networkaddress within the database-structured query indicating a startingpoint.
 19. The system of claim 18 , wherein the database-structuredquery, further comprises, a regular expression within thedatabase-structured query used to determine the data to extract.
 20. Thesystem of claim 19 , wherein the regular expression within thedatabase-structured query, further comprises, a plurality of patternsused to determine the data to extract from the web domain address havingcontent.
 21. The system of claim 17 , further comprising an editor forcreating a template of regular expressions used to extract the data. 22.The system of claim 17 , further comprising at least one data extractionengine to extract the data.
 23. The system of claim 22 , wherein thedata extraction engine is a web crawler.